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ABSTRACT 

Objective To create a computable MEDication 
Indication resource (MEDI) to support primary and 
secondary use of electronic medical records (EMRs). 
Materials and methods We processed four public 
medication resources, RxNorm, Side Effect Resource 
(SIDER) 2, MedlinePlus, and Wikipedia, to create MEDI. 
We applied natural language processing and ontology 
relationships to extract indications for prescribable, 
single-ingredient medication concepts and all ingredient 
concepts as defined by RxNorm. Indications were coded 
as Unified Medical Language System (UMLS) concepts 
and International Classification of Diseases, 9th edition 
(ICD9) codes. A total of 689 extracted indications were 
randomly selected for manual review for accuracy using 
dual-physician review. We identified a subset of 
medication-indication pairs that optimizes recall while 
maintaining high precision. 

Results MEDI contains 3112 medications and 63 343 
medication-indication pairs. Wikipedia was the largest 
resource, with 2608 medications and 34 91 1 pairs. For 
each resource, estimated precision and recall, 
respectively, were 94% and 20% for RxNorm, 75% and 
33% for MedlinePlus, 67% and 31% for SIDER 2, and 
56% and 51% for Wikipedia. The MEDI high-precision 
subset (MEDI-HPS) includes indications found within 
either RxNorm or at least two of the three other 
resources. MEDI-HPS contains 13 304 unique indication 
pairs regarding 2136 medications. The mean±SD 
number of indications for each medication in MEDI-HPS 
is 6.22+6.09. The estimated precision of MEDI-HPS is 
92%. 

Conclusions MEDI is a publicly available, computable 
resource that links medications with their indications as 
represented by concepts and billing codes. MEDI may 
benefit clinical EMR applications and reuse of EMR data 
for research. 



INTRODUCTION 

Medication and diagnosis data are vital to clinical 
care and are core features of electronic medical 
records (EMRs). Medications are prescribed to treat 
disease (ie, the medication's intended indication), but 
they can also cause disease (ie, an adverse effect). 
Linking medications with their diagnoses electronic- 
ally could improve evaluating treatment outcomes, 1 2 
assessing healthcare quality, 3 4 and performing clin- 
ical and genomic research by enhancing understand- 
ing of a patient's longitudinal disease and treatment 
record. 5 6 However, medications are not explicitly 
linked to their indications within most EMRs, and 
research into computational resources to enable such 
linkage is limited. In this paper, we integrated four 
medication resources to create a freely available, 



computable MEDication-Indication (MEDI) 
resource, and describe its initial evaluation to assist in 
computational linkage of medications to their 
indications. 

BACKGROUND 

A medication's indication is the disease or condi- 
tion for which it was prescribed in a given instance. 
Medications are typically prescribed without any 
structured record of indication in the EMR. In 
some cases, such as for medications prescribed 
using explicit order sets designed for a given diag- 
nosis, a human can infer the indication from the 
order or clinical documentation. However, in 
general, computational inference of a medication's 
indication from EMR data is difficult. For example, 
disease-specific order sets are primarily found only 
in the inpatient setting, are not comprehensive for 
all diagnoses, and do not assert the diagnosis with 
certainty (eg, a provider may use the pneumonia 
order set for convenience when in fact the patient 
has a different infection). 

Each medication can have many indications, and 
indications can be classified as either on-label or 
off-label. On-label indications are proposed in the 
early process of drug development by the manufac- 
turer and later approved by the Food and Drug 
Administration (FDA) after demonstrating efficacy 
through clinical trials. These on-label indications 
appear on the package insert for the medication. 
For example, metformin is FDA- approved to treat 
type 2 diabetes, and ampicillin may be prescribed 
to treat urinary tract infections, otitis media, or 
pneumonia. Many on-label indications can be 
retrieved freely from the FDA's DailyMed website. 7 
DailyMed currently contains drug labels for about 
40 994 brand and generic medications for both 
humans and animals. Off-label indications are con- 
ditions for which the medication is used, but which 
have not been approved by the FDA and do not 
appear on the package insert. Many medications 
have common off-label indications. 8 For example, 
metformin is used off-label to treat polycystic 
ovarian disease, 9 and ampicillin is used off-label for 
diverticulitis. 10 

Typically, off-label indications are based on scien- 
tific evidence found subsequent to the FDA approval 
process and collective physician experience. 11 By 
nature, off-label indications can be controversial, 
such as the use of statins (a class of cholesterol- 
lowering medications) for diabetes, regardless of the 
patient's cholesterol levels. 12 13 Evidence for off-label 
use may be scattered among various drug resources. 
Although some proprietary resources list both 
on-label and off-label indications (eg, Epocrates, 
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FirstDataBank, and LexiComp), these resources are not freely 
available. Thus, it can be difficult to obtain a complete list of medi- 
cation indications, and using a single resource (especially 
DailyMed) may miss important or common indications. These 
resources are generally formatted as free text, and require extra 
processing to convert them into a computable format. 

Various medication resources have been created by leveraging 
either the EMR or literature for pharmaceutical research (eg, 
new drug discovery and adverse drug detection). 14 For instance, 
the Therapeutic Target Database (TTD) contains information 
about medications and their therapeutic targets and provides 
corresponding cross-links from the ClinicalTrials.gov database. 15 
However, TTD is designed for new drug discovery; most of its 
data are oriented for drug development instead of clinical use. 
Another resource, the Side Effect Resource (SIDER), was devel- 
oped using text-mining techniques applied to FDA-approved 
drug labels. SIDER provides a list of FDA-approved indications 
on marketed medications mined from FDA drug labels obtained 
from DailyMed. 16 Since its major focus is on side effects rather 
than indications, the indication list has not been thoroughly 
evaluated. Another important and relevant source is RxNorm, 
developed and maintained by the National Library of Medicine 
(NLM). 17 RxNorm is an ontology designed for exchanging 
medication information among clinical systems. It maintains a 
comprehensive list of commonly used medications (both generic 
and branded, with structured linkages between them), along 
with their forms, ingredients, and dosages. The integration of 
RxNorm with the National Drug File-Reference Terminology 
(NDF-RT) from the Veterans Health Administration has added 
significant indication information between single-ingredient 
medications and diseases through 'may_treat' and t may_prevent' 
therapeutic relationships. 17 NDF-RT includes both on-label and 
off-label indications, but its performance on indications has not 
been previously reported. Preliminary work with earlier versions 
of RxNorm and NDF-RT demonstrated that a number of medi- 
cations were lacking indications. 18 19 

In this paper, we proposed a novel ensemble approach that 
embraces multiple commonly used medication resources to 
create a computable drug resource, called MEDI. We believe 
that MEDI may assist in clinical applications within EMRs and 
the secondary use of EMR data. 

METHODS 
Data sources 

We selected four medication resources as inputs into MEDI. 
The four resources included: (1) RxNorm (downloaded on June 
4, 2012); (2) SIDER 2 (released on March 16, 2012)— a public 
medication knowledge base targeting adverse drug reactions 
extracted from FDA drug labels; (3) MedlinePlus (http://www. 
nlm.nih.gov/medlineplus) — an NLM-maintained website that 
offers consumer health information for patients, families, and 
healthcare providers; and (4) Wikipedia — an online collabora- 
tively edited encyclopedia. RxNorm and SIDER 2 maintain 
indication information within a formal table structure with 
structured (ie, coded) medication and indication information. 
MedlinePlus and Wikipedia are free-text based and required 
further processing (figure 1). 

Medication indication extraction 

We retrieved all single-ingredient medication concepts (repre- 
sented by RxNorm concept unique identifiers (RxCUIs) and 
defined as having only one 'has_ingredient' relationship), includ- 
ing clinical drugs and brand names from the prescribable subset 
of RxNorm. We also retrieved all ingredient RxCUIs from 



RxNorm, which were determined by term type (Term Type in 
Source (TTY) = 'IN' or TTY= 'MIN', or TTY='PIN') and 
included both single-ingredient and multi-ingredients. RxNorm 
covers almost all prescription medications currently marketed in 
the USA. Based on relationships within RxNorm, 17 all concepts 
were then collapsed into groups by their ingredients. For 
example, 'Tylenol Caplet, 325 mg oral tablet' (RxCUI 209387) 
was mapped to Acetaminophen' (RxCUI 161). 

To obtain indications of a medication from RxNorm, we 
retrieved all diseases that connect with the medication through 
either 'may_be_treated_by' or 'may_be_prevented_by' relation- 
ships. For SIDER 2, we mapped medications to corresponding 
RxCUIs where their brand names or drug names matched the 
terms associated with the RxCUIs. The mapping involved two 
steps: the first step was looking for exact matches and the 
second step was searching for partial matches if no exact match 
was found in step 1 — for example, 'salbutamol sulfate' was 
mapped to 'salbutamol'. We then retrieved all disease indications 
documented within the SIDER 2 indication table, which are 
mapped to Unified Medical Language System (UMLS) concepts. 

To obtain medication indications from MedlinePlus, we first 
retrieved the webpage for each medication through the 
MedlinePlus Application Programming Interface (API) using the 
medication's RxCUI as the query input. The resulting Hypertext 
Markup Language (HTML) pages were parsed and stored as text 
files. MedlinePlus maintains a consistent document structure for 
its drug monographs, although the text within each section is 
free-text. We found that certain sections of MedlinePlus fre- 
quently contained drug indication information. We limited our 
analysis of the MedlinePlus description to the sections 'Why is 
this medication prescribed', About your treatment', and 'Other 
uses for this medicine', thus ignoring sections such as 'What side 
effects can this medication cause' and 'Precautions'. We used the 
KnowledgeMap Concept Indexer (KMCI) to parse the free-text 
to obtain all non-negated 'disease and finding' concepts, as 
mapped to UMLS concepts. KMCI is a general-purpose natural 
language processing (NLP) engine that maps free-text documents 
to UMLS concepts and includes negation detection through an 
adaptation of the NegEx algorithm. 20 KMCI has performed 
favorably in comparison with MetaMap 21 for medical school 
curriculum documents and has been validated in a variety of clin- 
ical and education contexts. 22-25 

To identify medication pages in Wikipedia, we queried the 
Wikipedia API with medication strings derived from RxNorm 
(querying with both brand and generic names for each drug). 
We used KMCI to identify non-negated disease concepts from 
the resulting Wikipedia pages as we did for MedlinePlus. 
However, since Wikipedia does not contain a formal structure 
clearly annotating medication indications, we employed heuris- 
tic rules. In perusal of Wikipedia entries, we noted that most 
medication entries listed indications before side effects, which 
were often listed in separate sections. Thus, we excluded any 
concepts found after a 'side-effect', 'safety', or 'toxicology' 
section. For entries that were just text based without being sepa- 
rated into sections, all content was parsed. 

All disease concepts extracted from MedlinePlus or Wikipedia 
were initially represented as UMLS concepts. After processing 
by KMCI, concepts were restricted to those that could be 
mapped to International Classification of Diseases, Ninth 
Revision, Clinical Modification (ICD9) codes using UMLS rela- 
tionships (as defined in MRREL), and thus included diseases, 
syndromes, symptoms, and other clinical findings. Both the ori- 
ginal concept and the resultant ICD9 concept were kept for 
MEDI. ICD9 codes were chosen since these are commonly 
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Figure 1 Flowchart for MEDication- 
Indication (MEDI) creation. HPS, 
high-precision subset; ICD9, 
International Classification of Diseases, 
9th edition; KMCI, KnowledgeMap 
Concept Indexer; RxCUl, RxNorm 
concept unique identifier; SIDER, 
Side Effect Resource. 
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available codes within most EMR systems. For those 
SNOMED-CT concepts that could not be directly mapped to 
ICD9 codes through UMLS relationships, we used the 
SNOMED-CT ICD9 CrossMap 26 to map them into corre- 
sponding ICD9 codes, where possible. We only used relation- 
ships with map advice equal to 1 (one-to-one SNOMED-CT to 
ICD9 map) or 2 (Narrow to Broad SNOMED-CT to ICD9 
map). 

Evaluation 

We categorized each medication-indication pair by the combin- 
ation of resources in which it was found (RxNorm alone, 
RxNorm and SIDER 2, etc). Each category, represented by a row 
in table 1, is one of 15 possible combinations of our four sources. 
Each source is positive for eight of these combinations, meaning 
that all medication-indication pairs in that category were found 
in the resource (these are indicated by a 'Y' in table 1). 

We calculated the true positive rate for each category by 
manually evaluating 50 randomly selected medication-indica- 
tion pairs per category. Two practicing physicians (JCD and 
RMC) each reviewed the indications independently, and differ- 
ences were resolved by consensus. Physicians used clinical 
experience, search of drug resources and medical references, 
and web and PubMed searching to determine the veracity of 
medication-indication pairs. 

We estimated precision and recall of each resource, r, using 
equations (1) and (2), where C(r) is the set of eight categories 



for which the resource, r, is positive, size(n) is the number of 
medication-indication pairs in category n, and TPR(n) is the 
true positive rate for category n. 



Precision(r) 



Recall (r) 



Enec(r) size(n)-TPR(n) 

EnEC(r) Size ( n ) 



E nEC(r ) size(n)-TPR(n) 
£ alln size(n).TPR(n) 



(1) 



(2) 



These equations estimate the standard precision and recall 
measures for each resource, r, but they do so using stratified 
sampling over the categories. 

To demonstrate that MEDI has a broader coverage than 
RxNorm, we compared the indications in MEDI with the indi- 
cations in RxNorm within the context of cancer. Cancer was 
chosen because it is a broad group of important diseases easily 
identified through a single set of ICD9 codes (140-239) 
covered primarily by prescription medications with well-defined 
indications. We compared MEDI with RxNorm, and then ran- 
domly selected a few medication-indication pairs for validation. 
We validated through use of general medicine resources such as 
UpToDate and PubMed searches. 
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Table 1 Validation results by two reviewers 


RxNorm 


MedlinePlus 


Wikipedia 


SIDER 2 


Size 


Pairs reviewed 


False positives 


Precision (%) 


1 resource 
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Y 


10880 


49 


24 


51 
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28323 


45 


24 


47 


N 


Y 
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N 


10836 


46 


16 


65 


Y 


N 


N 


N 


3683 


35 


4 


89 


2 resources 
















N 


N 


Y 


Y 


1592 


35 


8 


77 


N 


Y 


N 


Y 


1464 


49 


4 


92 


N 


Y 


Y 


N 


1142 


47 


5 


89 


Y 


N 


N 


Y 


813 


40 


1 


98 
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Y 


N 


868 


31 


0 


100 


Y 
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N 


381 


39 


2 


95 


3 mcni i rrac 
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Y 


1066 


48 
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98 
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Y 


603 


46 
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96 
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375 


56 
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96 
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Y 


Y 


N 


408 


39 


1 


97 


All 4 resources 
















Y 


Y 


Y 


Y 


909 


65 


0 


100 


N, no; SIDER, Side Effect Resource; Y, yes. 



RESULTS 

Medication indication extraction 

From 93 479 unique concepts in the RxNorm prescribable 
table, we retrieved 61 450 medication concepts that could be 
mapped to a single ingredient concept (via having only one 
'has_ingredient' RxCUI relationship) in RxNorm. Thus, these 
61 450 single-ingredient medication concepts, which include 
brand names and various clinical drug forms, were then 
grouped into 4003 unique RxNorm medication ingredients (ie, 
TTY='IN' or TTY='MIN', or TTY='PIN'). Of these 4003 



medication ingredients, 3112 (78%) had at least one indication 
extracted from at least one of the four resources, and 2114 
(53%) had indication extracted from at least two (figure 2, left). 
The 891 medication ingredients without any indication 
extracted from any resource were typically not medications — 
for example, 'kiwi allergenic extract' (RxCUI 1010926), 'lime' 
(RxCUI 1011060), and 'sugar cane extract' (RxCUI 1014711). 

From the 3112 medications with indications, we identified 
3009 unique ICD9 codes and 63 343 indication pairs (one 
RxCUI and one ICD9 code) (table 2). The mean±SD number 




Medications Indication Pairs 

Figure 2 Weighted Venn diagram of the distribution of 3112 medications (left) and 63 343 indication pairs (right) within the four resources. Each 
border color represents a resource. Different colored areas represent medications-indications that were found within different combinations of 
resources. The area sizes surrounded by border color(s) are proportional to the number of medications-indications that were found within the 
corresponding resource(s). SIDER, Side Effect Resource. 
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Table 2 


Number of unique medications, ICD9 codes, and 


indication pairs extracted from each resource 






Medications (% of 


ICD9 codes 


Indication pairs 


Resource 


total) 


(% of total) 


(% of total) 


RxNorm 


1726 (56) 


999 (33) 


8040 (13) 


SIDER 2 


1554 (50) 


1703 (57) 


17702 (28) 


MedlinePlus 


1629 (52) 


869 (29) 


16581(26) 


Wikipedia 


2608 (84) 


2624 (87) 


34911 (55) 


Union of all 


3112 


3009 


63343 


resources 








SIDER, Side Effect Resource. 



of extracted indications for each medication was 20.35 ±22.00, 
and mode was 3. Of these 63 343 indication pairs, 53 722 
(85%) were found within a single resource, and 9621 (15%) 
were found within two or more resources (figure 2, right). 

Medication indication validation 

We initially chose ~700 medication-indication pairs for review 
(~50 pairs from each category) before we noticed that some of 
the disease concepts retrieved by NLP were older CUIs that 
overlapped with RxNorm concepts but appeared as duplicates 
in the initial review. We then mapped older (deleted) UMLS 
concepts to current UMLS concepts using the UMLS history 
files; therefore, the number of medication-indication pairs 
reviewed in each category was not equal. A total number of 689 
medication-indication pairs were finally reviewed for validation. 
Among these 689 indications, 19 were marked as uncertain (eg, 
ondansetron and irritable bowel syndrome, albumin and dehy- 
dration, estradiol and other malaise and fatigue, etc) because 
both reviewers agreed that they really were not representative of 
what the true indication was but yet were not really false indica- 
tions. Therefore, we ignored these uncertain ones in subsequent 
analyses. As shown in table 1, the precision was 100% for indi- 
cations found within all four resources; precisions were above 
95% for indications found within any three of the four 
resources. Precision was near 90% for indications found within 
two resources, except for indications found only in SIDER 2 
and Wikipedia (a precision of 77%). For indications found 
within only Wikipedia, SIDER 2, or MedlinePlus, the precision 



dropped to 47-65%. In contrast, indications found only with 
RxNorm still had high precision (89%). Table 3 shows a 
random selection of some of the errors from each resource. 

Table 4 presents the estimated recall and precision for each 
resource as defined by equations 1 and 2. We found that 
RxNorm gave a remarkably high precision (94%) but a rela- 
tively low recall (20%). Compared with RxNorm, other 
resources had lower precision (56-75%) but higher recall (31- 
51%). Wikipedia achieved the best recall (51%) among the four 
resources. The differences between RxNorm and the other 
three resources in terms of recall and precision may be 
explained by the fact that the RxNorm indications were already 
curated and stored in a structured format, while the indications 
from the other three resources were based on either NLP 
concept retrieval or text mining techniques. 

Currently, MEDI contains a total of 3112 medications and 
63 343 medication-indication pairs found within the four 
sources. From our validation results, we observed that high pre- 
cision could be achieved with indications found within any two 
(or more) of the four resources. Indications solely within 
RxNorm also had a high precision. Thus, to optimize recall 
while maintaining reasonable precision, we defined the MEDI 
high-precision subset (MEDI-HPS) as the indications found 
within either at least two of the four resources or RxNorm. The 
current version of MEDI-HPS contains 13 304 unique indica- 
tion pairs for 2136 medications. The mean number of indica- 
tions for each medication is 6.22±6.09. The mode for each 
medication is 2, while the median is 4. Examples of four com- 
monly used medications from MEDI are provided in online 
supplementary table SI. 

MEDI-HPS offers a comparable number of indications but a 
much higher precision (92%) than MedlinePlus (75%), SIDER 
2 (67%), or Wikipedia (56%). Wikipedia, as the only uncon- 
trolled resource we utilized, contributes the largest number of 
indications, but also has the lowest precision. MEDI-HPS has 
slightly lower precision (92%) than either RxNorm (94%) or 
>2 resources (93%) because the precision of medication-indica- 
tion pairs found only within RxNorm is 89%. 

Compared with RxNorm, MEDI-HPS maintains a similar high 
precision (92%) but provides 5264 (66%) more indications. To 
demonstrate this advantage, we compared the coverage of 
MEDI-HPS with RxNorm within the context of cancer medica- 
tions. We retrieved all medications that have cancer (ICD9 codes 



Table 3 Selected example errors from each resource 



Resource 


Medication (RxCUl) 


Disease (ICD9) 


Comment 


RxNorm 


Captopril (1998) 


Rheumatoid arthritis 
(714.0) 


The indication was supported by a small case series in 1984 27 but has not been widely accepted 
thereafter 




Isosorbide (6057) 


Esophageal reflux (530.81) 


Isosorbide can be used for esophageal spasm, but may cause reflux 


MedlinePlus 


Sildenafil (136411) 


Other malaise and fatigue 
(780.79) 


NLP falsely identified a concept that is irrelevant to an indication: 'Sildenafil is used to improve 
the ability to exercise in people with pulmonary arterial hypertension (PAH; high blood pressure 
in the vessels carrying blood to the lungs, causing shortness of breath, dizziness, and tiredness)' 




Dexmethylphenidate 


Other specified visual 


Mismatched disease concept by NLP: 'Dexmethylphenidate is used as part of a treatment 




(352372) 


disturbances (368.8) 


program to control symptoms of attention deficit hyperactivity disorder (ADHD; more difficulty 
focusing,...) in adults and children' 


Wikipedia 


Ciprofloxacin (2551) 


Cystic fibrosis (277.0) 


However, the fluoroquinolones are licensed to treat lower respiratory infections in children with 
cystic fibrosis in the UK 




Guaifenesin (5032) 


Asthma (493) 


Guaifenesin is claimed to be effective in the treatment of the thickened bronchial mucosa 
characteristic of asthma 


SIDER 2 


Dobutamine (3616) 


Atrial fibrillation (427.31) 


Contraindication/side effect 




Ephedrine (3966) 


Hypertension NOS (401 .9) 


Contraindication/side effect 



NOS, not otherwise specified; NLP, natural language processing; RxCUl, RxNorm concept unique identifier; SIDER, Side Effect Resource. 
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Table 4 Estimated precision and recall for different resources 





Medications 


Indication 
pairs 


Precision 

v /u / 


Recall 

V /u / 


RxNorm 


1726 


8040 


94 


20 


MedlinePlus 


1629 


16581 


75 


33 


Wikipedia 


2608 


34911 


56 


51 


SIDER2 


1554 


17702 


67 


31 


4 resources 


433 


909 


100 


2 


>3 resources 


1108 


3361 


98 


9 


>2 resources 


1847 


9621 


93 


23 


MEDI (>1 resource) 


3112 


63343 


60 


100 


MEDI-HPS (>2 or 


2136 


13304 


92 


30 



RxNorm) 

HPS, high-precision subset; MEDI, medication indication resource; SIDER, Side Effect 
Resource. 



140-239) as an indication. MEDI-HPS included 269 cancer medi- 
cations while RxNorm only had 166. A total of 103 (38%) medi- 
cations were absent in RxNorm, including plerixafor, romidepsin, 
raloxifene, pralatrexate, and eribulin — all valid cancer drugs 
whose indications listed in MEDI-HPS were validated through lit- 
erature review. 

MEDI is available in a comma-separated values file format. 
The file consists of medications represented by RxCUIs and 
indications mapped to UMLS CUIs and ICD9 codes, as well as 
other metadata including a column called 'possible_label_use'. 
The value of 'possible_label_use' is 1 when the indication is 
mentioned in SIDER 2 and 0 when it is not. Our assumption is 
that, since SIDER 2 is extracted from drug labels, indications 
mentioned within SIDER 2 are highly likely to be on-label (ie, 
FDA-approved) uses. 

DISCUSSION 

By leveraging existing public resources, ontologies, and NLI^ we 
created a computable medication indication resource for both 
on- and off-label indications that is mapped to standard billing 
codes and structured vocabularies. The current version of MEDI 
contains 63 343 medication-indication pairs for 3112 medica- 
tions. MEDI-HPS, the high precision subset of MEDI, provides 
13 304 indication pairs for 2136 medications. The precision or 
recall of MEDI-HPS was better than RxNorm, SIDER 2, 
MedlinePlus, and Wikipedia by themselves. MEDI (and future 
resources like it) may facilitate computational linkage of pre- 
scriptions with their indications, enabling both clinical and 
research use of EMR data. 

The adoption of EMRs has been rapidly expanding in the 
USA since 2008, especially after the passing of the Health 
Information Technology for Economic and Clinical Health 
(HITECH) Act. 28 Requirements such as maintaining structured 
lists of problems, medications and allergies, and electronic pre- 
scribing are key components of meaningful use stage one. 29 The 
continuing accumulation of EMR data will present unprece- 
dented opportunities for clinical research. However, the 'infor- 
mation gap' between medications and diseases precludes the 
efficient use of these practice-based medication data, hindering 
the primary and secondary use of EMRs. MEDI and tools like it 
may begin to fill that gap. 

MEDI may be useful in current phenotype algorithms 30-33 or 
for future deep phenotyping, 5 6 34 35 both of which require 
detailed clinical data to accurately classify patients into subpopu- 
lations with respect to a disease, a phenotypic subclass of a 



disease, or a response to a treatment. For instance, medications 
were used in addition to ICD9s in EMR phenotype algorithms 
for type 2 diabetes mellitus, 31 33 36 Crohn's disease, 32 rheuma- 
toid arthritis, 30 and many of the other algorithms deployed in 
the Electronic Medical Records and Genomics (eMERGE) 
Network to identify cases and controls for genome-wide associ- 
ation studies. 37-39 MEDI may also improve the accuracy of the 
detection of adverse drug reactions 40 and elevate the quality 
and utility of the EMR problem lists. 41 In addition, tools such 
as MEDI may also improve the precision of phenome-wide 
EMR phenotyping methods, such as the ICD9-based phenome- 
wide association studies (PheWAS) method, by allowing integra- 
tion of two axes of clinical information. 38 42 

Wikipedia is one of the most commonly visited websites in 
the world, but it has rarely been evaluated in the medical litera- 
ture, owing possibly in part to being an uncontrolled source 
with uncontrolled structure, challenging its use in medical appli- 
cations. A small study in 2005 reported that Wikipedia had 
similar accuracy to Encyclopedia Britannica. 43 Our study shows 
that Wikipedia's recall on indications is significantly higher than 
that of RxNorm, SIDER 2, and MedlinePlus. In addition, we 
noted that Wikipedia contains a number of homeopathic/alter- 
native medications and treatments (eg, parsley and nephrolithia- 
sis) that are not in other resources. Homeopathic medications, 
because they are not RxNorm prescribable medications, are not 
included in the current version of MEDI, but may be included 
in the future. 

One source of error and possible area for improvement is the 
mapping of indications from free-text resources. NLP-induced 
errors were largely caused by a mismatched disease concept, a 
failure to recognize negation, or a failure to identify that a 
concept was actually a side effect/complication. For instance, in 
the sentence 'the process is called starch gelatinization', KMCI 
falsely identified a disease concept — CALL (precursor B-cell 
lymphoblastic leukemia, CUI C 1292769) based on the normal- 
ization of the past participle 'called' to 'call', and the mapping 
of 'call' to the acronym 'CALL.' This category of error may be 
resolved by disallowing mappings between normalized strings 
and UMLS acronyms. In another sentence, Adefovir will not 
cure hepatitis B and may not prevent complications of chronic 
hepatitis B such as cirrhosis of the liver or liver cancer', KMCI 
failed to recognize that the 'liver cancer' was negated because of 
the distance between the subject and the target term. We hope 
to correct these classes of error in future work. 

MEDI does not replace existing commercially available 
resources. Many commercially available resources (eg, Epocrates 44 
and LexiComp 45 ) provide not only drug-indication pairing but 
also dose guidance that can be tailored to indication, drug formu- 
lations, international brand names, safety warnings, and adverse 
reactions; none of this information is currently provided in MEDI 
(nor is it the goal of MEDI to be a comprehensive prescribing 
guide). However, medication and indications are represented by 
formal concepts and billing codes in MEDI rather than embedded 
in free text as in Epocrates or LexiComp. Conceptual formaliza- 
tion should facilitate research and application creation. A random 
review of a few common medications noted that, for some, MEDI 
may include more indications than commercial resources. For 
example, MEDI-HPS shows that propranolol can be used for con- 
gestive heart failure, panic disorder, and thyrotoxicosis (each clin- 
ically valid uses not listed in Epocrates and only one of which, 
thyrotoxicosis, is listed in LexiComp) in addition to hypertension, 
migraines, angina, myocardial infarction, pheochromocytoma, 
arrhythmias, and essential tremor, all of which are listed by all 
resources. However, MEDI shows that metformin can be used to 
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treat obesity, which has been supported in research trials 46 47 but 
is not a common indication. Thus, since MEDI is not a curated 
resource, some of its listed indications will likely be false positives 
compared with commercial resources. 

Several limitations regarding the creation and evaluation of 
MEDI should be clarified. First, MEDI is limited to medications 
and indications found in those four resources. Although the 
resulting precision is encouraging, the addition of other 
resources may improve both recall and precision. Second, MEDI 
currently primarily includes medications composed of a single 
ingredient (97.7%); only 2.3% were multi-ingredient concepts. 
Thus, MEDI probably does not include all prescription medica- 
tions on the market today, which may be especially true for 
combination medications. Users should be careful about this 
limitation when they utilize MEDI or MEDI-HPS to conduct 
research or create applications. Third, we have not made any 
judgments about the strength of evidence for off-label uses 
when building MEDI; this information is not easily found. In 
addition, we estimated recall and precision for each vocabulary, 
but they could be skewed because of the resources we chose and 
the artifacts of NLE Finally, we lacked a complete list of indica- 
tions or a true gold standard to evaluate recall. In our analysis, 
recall was estimated using the assumption that the true positive 
drug-indication pairs from all sources represented the universe 
of all possible drug-indication pairs. Thus, true recall is likely 
lower since our method could not detect indications not listed 
in one of the four resources. 

This paper introduces our initial efforts to create MEDI. Future 
work can make this resource more complete and more robust. For 
example, further elucidation of off-label status and drug form 
information would improve MEDFs clinical usability. In addition, 
medications are not all used with equal frequency. Evaluation of 
indication prevalence using EMR data may aid identification of 
new medication-indication pairs and improve accuracy. 

CONCLUSION 

In summary, MEDI is a freely available, computable medication 
indication resource that is more comprehensive than existing 
freely available resources. Because it utilized UMLS concepts 
and ICD9 codes, MEDI can be easily used in conjunction with 
billing codes or concepts extracted from free-text using NLE 
Our results demonstrate its broad coverage and high accuracy. 
MEDI may enable research and clinical EMR applications. 
MEDI is freely available at http://knowledgemap.mc.vanderbilt. 
edu/research/content/MEDI. We plan to update MEDI periodic- 
ally as component resources are revised. 
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